Introduction

Structured Query Language (SQL) is a powerful tool for managing and manipulating relational databases. It is essential for data scientists, analysts, and anyone working with large datasets. This chapter will explore the importance of SQL, its applications, and provide example code format to illustrate its utility.

Why SQL is Important

  1. Data Management: SQL allows for efficient management of large volumes of data. It provides the means to create, read, update, and delete data in a relational database.
  2. Data Retrieval: With SQL, you can perform complex queries to retrieve specific data from one or more tables, making it easier to analyze and draw insights.
  3. Data Manipulation: SQL enables the manipulation of data through operations such as sorting, filtering, and aggregating. This is crucial for data cleaning and preprocessing.
  4. Data Integration: SQL supports the integration of data from different sources, allowing for comprehensive data analysis.
  5. Standardization: SQL is a standardized language used by most relational database management systems (RDBMS), making it a versatile and essential skill for professionals in the field.

Basic SQL Concepts

import sqlite3
import pandas as pd

Basic SQL Concepts

To illustrate the following SQL concepts, we will use the employees table with the following data:

Table: employees

id first_name last_name department salary
1 Alice Smith HR 60000
2 Bob Johnson IT 80000
3 Charlie Lee Sales 55000
4 David Kim HR 75000
5 Eva Brown IT 65000
6 Frank Wilson Sales 70000
7 Grace Taylor HR 62000
8 Henry Anderson IT 77000
9 Irene Thomas Sales 53000
10 Jack White HR 58000
11 Karen Harris IT 69000
12 Leo Martin Sales 50000
13 Mia Jackson HR 64000
14 Noah Lee IT 72000
15 Olivia Perez Sales 68000
16 Paul Young HR 61000
17 Quinn King IT 76000
18 Rachel Scott Sales 57000
19 Sam Green HR 63000
20 Tina Adams IT 81000

SELECT and FROM

The SELECT statement is used to fetch data from a database, and the FROM clause specifies the table.

-- Selecting all columns from a table
p = """

SELECT * 
FROM 
  employees;

"""

pd.read_sql_query(p, con)
id first_name last_name department salary
1 Alice Smith HR 60000
2 Bob Johnson IT 80000
3 Charlie Lee Sales 55000
4 David Kim HR 75000
5 Eva Brown IT 65000
6 Frank Wilson Sales 70000
7 Grace Taylor HR 62000
8 Henry Anderson IT 77000
9 Irene Thomas Sales 53000
10 Jack White HR 58000
11 Karen Harris IT 69000
12 Leo Martin Sales 50000
13 Mia Jackson HR 64000
14 Noah Lee IT 72000
15 Olivia Perez Sales 68000
16 Paul Young HR 61000
17 Quinn King IT 76000
18 Rachel Scott Sales 57000
19 Sam Green HR 63000
20 Tina Adams IT 81000
-- Selecting specific columns
p = """

SELECT 
  first_name, 
  last_name, 
  salary 
FROM 
  employees;

"""

pd.read_sql_query(p, con)
id first_name last_name salary
1 Alice Smith 60000
2 Bob Johnson 80000
3 Charlie Lee 55000
4 David Kim 75000
5 Eva Brown 65000
6 Frank Wilson 70000
7 Grace Taylor 62000
8 Henry Anderson 77000
9 Irene Thomas 53000
10 Jack White 58000
11 Karen Harris 69000
12 Leo Martin 50000
13 Mia Jackson 64000
14 Noah Lee 72000
15 Olivia Perez 68000
16 Paul Young 61000
17 Quinn King 76000
18 Rachel Scott 57000
19 Sam Green 63000
20 Tina Adams 81000

SELECT EXCLUDE and RENAME

You can exclude columns using SELECT and rename them for clarity.

-- Selecting all but one column
p = """

SELECT * 
EXCLUDE 
  salary 
FROM 
  employees;

"""

pd.read_sql_query(p, con)
id first_name last_name department
1 Alice Smith HR
2 Bob Johnson IT
3 Charlie Lee Sales
4 David Kim HR
5 Eva Brown IT
6 Frank Wilson Sales
7 Grace Taylor HR
8 Henry Anderson IT
9 Irene Thomas Sales
10 Jack White HR
11 Karen Harris IT
12 Leo Martin Sales
13 Mia Jackson HR
14 Noah Lee IT
15 Olivia Perez Sales
16 Paul Young HR
17 Quinn King IT
18 Rachel Scott Sales
19 Sam Green HR
20 Tina Adams IT
-- Renaming columns
p = """

SELECT 
  first_name AS fname, 
  last_name AS lname 
FROM 
  employees;

"""

pd.read_sql_query(p, con)
fname lname
Alice Smith
Bob Johnson
Charlie Lee
David Kim
Eva Brown
Frank Wilson
Grace Taylor
Henry Anderson
Irene Thomas
Jack White
Karen Harris
Leo Martin
Mia Jackson
Noah Lee
Olivia Perez
Paul Young
Quinn King
Rachel Scott
Sam Green
Tina Adams

LIMIT and OFFSET

The LIMIT clause restricts the number of rows returned, and OFFSET skips rows before beginning to return rows.

-- Limiting the number of rows returned

p = """

SELECT * 
FROM 
  employees 
LIMIT 
  10;

"""

pd.read_sql_query(p, con)
id first_name last_name department salary
1 Alice Smith HR 60000
2 Bob Johnson IT 80000
3 Charlie Lee Sales 55000
4 David Kim HR 75000
5 Eva Brown IT 65000
6 Frank Wilson Sales 70000
7 Grace Taylor HR 62000
8 Henry Anderson IT 77000
9 Irene Thomas Sales 53000
10 Jack White HR 58000
-- Skipping rows
p = """

SELECT * 
FROM 
  employees 
LIMIT 10 
OFFSET 5;

"""

pd.read_sql_query(p, con)
id first_name last_name department salary
6 Frank Wilson Sales 70000
7 Grace Taylor HR 62000
8 Henry Anderson IT 77000
9 Irene Thomas Sales 53000
10 Jack White HR 58000
11 Karen Harris IT 69000
12 Leo Martin Sales 50000
13 Mia Jackson HR 64000
14 Noah Lee IT 72000
15 Olivia Perez Sales 68000

ORDER BY

The ORDER BY clause sorts the result set.

-- Sorting the result set by salary in ascending order
p = """

SELECT * 
FROM  
  employees 
ORDER BY 
  salary;

"""

pd.read_sql_query(p, con)
id first_name last_name department salary
12 Leo Martin Sales 50000
9 Irene Thomas Sales 53000
3 Charlie Lee Sales 55000
10 Jack White HR 58000
1 Alice Smith HR 60000
19 Sam Green HR 63000
7 Grace Taylor HR 62000
13 Mia Jackson HR 64000
5 Eva Brown IT 65000
11 Karen Harris IT 69000
18 Rachel Scott Sales 57000
15 Olivia Perez Sales 68000
6 Frank Wilson Sales 70000
14 Noah Lee IT 72000
17 Quinn King IT 76000
8 Henry Anderson IT 77000
4 David Kim HR 75000
2 Bob Johnson IT 80000
20 Tina Adams IT 81000
16 Paul Young HR 61000
-- Sorting in descending order
p = """

SELECT * 
FROM 
  employees 
ORDER BY 
  salary DESC;

"""

pd.read_sql_query(p, con)
id first_name last_name department salary
20 Tina Adams IT 81000
2 Bob Johnson IT 80000
8 Henry Anderson IT 77000
17 Quinn King IT 76000
4 David Kim HR 75000
14 Noah Lee IT 72000
6 Frank Wilson Sales 70000
15 Olivia Perez Sales 68000
11 Karen Harris IT 69000
5 Eva Brown IT 65000
13 Mia Jackson HR 64000
19 Sam Green HR 63000
7 Grace Taylor HR 62000
16 Paul Young HR 61000
1 Alice Smith HR 60000
10 Jack White HR 58000
3 Charlie Lee Sales 55000
9 Irene Thomas Sales 53000
18 Rachel Scott Sales 57000
12 Leo Martin Sales 50000

AND, OR, NOT

Logical operators filter records based on multiple conditions.

-- Using AND, OR, NOT operators
p = """

SELECT * 
FROM 
  employees 
WHERE 
  department = 'Sales' AND salary > 50000;

"""

pd.read_sql_query(p, con)
id first_name last_name department salary
6 Frank Wilson Sales 70000
15 Olivia Perez Sales 68000
18 Rachel Scott Sales 57000

Numeric Operations

Perform arithmetic operations in SQL.

-- Calculating a new column
p = """

SELECT 
  first_name, 
  last_name, 
  salary, 
  salary * 1.1 AS new_salary 
FROM 
  employees;
"""

pd.read_sql_query(p, con)
first_name last_name salary new_salary
Alice Smith 60000 66000.0
Bob Johnson 80000 88000.0
Charlie Lee 55000 60500.0
David Kim 75000 82500.0
Eva Brown 65000 71500.0
Frank Wilson 70000 77000.0
Grace Taylor 62000 68200.0
Henry Anderson 77000 84700.0
Irene Thomas 53000 58300.0
Jack White 58000 63800.0
Karen Harris 69000 75900.0
Leo Martin 50000 55000.0
Mia Jackson 64000 70400.0
Noah Lee 72000 79200.0
Olivia Perez 68000 74800.0
Paul Young 61000 67100.0
Quinn King 76000 83600.0
Rachel Scott 57000 62700.0
Sam Green 63000 69300.0
Tina Adams 81000 89100.0

LIKE and NOT LIKE

Pattern matching using LIKE.

-- Pattern matching
p = """

SELECT * 
FROM 
  employees 
WHERE 
  last_name 
LIKE 'S%';

"""

pd.read_sql_query(p, con)
id first_name last_name department salary
1 Alice Smith HR 60000
18 Rachel Scott Sales 57000

BETWEEN

Range filtering using BETWEEN.

-- Filtering within a range
p = """

SELECT * 
FROM 
  employees 
WHERE 
  salary BETWEEN 40000 AND 60000;

"""

pd.read_sql_query(p, con)
id first_name last_name department salary
12 Leo Martin Sales 50000
3 Charlie Lee Sales 55000
9 Irene Thomas Sales 53000
10 Jack White HR 58000
1 Alice Smith HR 60000

OFFSET

Skip a specific number of rows before starting to return rows.

-- Skipping the first 5 rows
p = """

SELECT * 
FROM 
  employees 
OFFSET 5;

"""

pd.read_sql_query(p, con)
id first_name last_name department salary
6 Frank Wilson Sales 70000
7 Grace Taylor HR 62000
8 Henry Anderson IT 77000
9 Irene Thomas Sales 53000
10 Jack White HR 58000
11 Karen Harris IT 69000
12 Leo Martin Sales 50000
13 Mia Jackson HR 64000
14 Noah Lee IT 72000
15 Olivia Perez Sales 68000
16 Paul Young HR 61000
17 Quinn King IT 76000
18 Rachel Scott Sales 57000
19 Sam Green HR 63000
20 Tina Adams IT 81000

Intermediate SQL Concepts

Joins

Table: employees

id first_name last_name department_id salary
1 Alice Smith 1 60000
2 Bob Johnson 2 80000
3 Charlie Lee 3 55000
4 David Kim 1 75000
5 Eva Brown 2 65000
6 Frank Wilson 3 70000
7 Grace Taylor 1 62000
8 Henry Anderson 2 77000
9 Irene Thomas 3 53000
10 Jack White 1 58000
11 Karen Harris 2 69000
12 Leo Martin 3 50000
13 Mia Jackson 1 64000
14 Noah Lee 2 72000
15 Olivia Perez 3 68000
16 Paul Young 1 61000
17 Quinn King 2 76000
18 Rachel Scott 3 57000
19 Sam Green 1 63000
20 Tina Adams 2 81000

Table: departments

department_id department_name
1 HR
2 IT
3 Sales

Combine rows from two or more tables based on a related column.

-- Inner join example
p = """

SELECT 
  employees.first_name, 
    employees.last_name, 
    departments.department_name
FROM 
  employees
INNER JOIN 
  departments ON employees.department_id = departments.department_id;

"""

pd.read_sql_query(p, con)
first_name last_name department_name
Alice Smith HR
David Kim HR
Grace Taylor HR
Jack White HR
Mia Jackson HR
Paul Young HR
Sam Green HR
Bob Johnson IT
Eva Brown IT
Henry Anderson IT
Karen Harris IT
Noah Lee IT
Quinn King IT
Tina Adams IT
Charlie Lee Sales
Frank Wilson Sales
Irene Thomas Sales
Leo Martin Sales
Olivia Perez Sales
Rachel Scott Sales

CAST

Convert data from one type to another.

-- Casting a column
p = """

SELECT 
  CAST(salary AS DECIMAL(10, 2)) 
FROM 
  employees;

"""

pd.read_sql_query(p, con)
salary
60000.00
80000.00
55000.00
75000.00
65000.00
70000.00
62000.00
77000.00
53000.00
58000.00
69000.00
50000.00
64000.00
72000.00
68000.00
61000.00
76000.00
57000.00
63000.00
81000.00

Aggregations

Perform calculations on a set of values.

-- Using aggregation functions
p = """

SELECT 
  department_id, 
  COUNT(employee_id) AS num_employees
FROM 
  employees
GROUP BY 
  department_id;

"""

pd.read_sql_query(p, con)
department_id num_employees
1 7
2 7
3 6

GROUP BY and HAVING

Group rows that have the same values and filter groups.

-- Grouping rows and filtering groups
p = """
SELECT 
  department_id,
  COUNT(employee_id) AS num_employees
FROM 
  employees
GROUP BY 
  department_id
HAVING COUNT(employee_id) > 5;
"""

pd.read_sql_query(p, con)
department_id num_employees
1 7
2 7
3 6

UNION, INTERSECT, MINUS

Combine result sets

Table: managers

manager_id first_name last_name
1 Michael Brown
2 Sarah Johnson
3 John Lee
-- Union example
p = """
SELECT 
  first_name 
FROM 
  employees
UNION

SELECT 
  first_name
FROM
   managers;
"""

pd.read_sql_query(p, con)
first_name
Alice
Bob
Charlie
David
Eva
Frank
Grace
Henry
Irene
Jack
Karen
Leo
Mia
Noah
Olivia
Paul
Quinn
Rachel
Sam
Tina
Michael
Sarah
John

POSITION

Find the position of a substring.

-- Finding substring position
p = """
SELECT 
  POSITION('e' IN first_name) 
FROM 
  employees;
"""

pd.read_sql_query(p, con)
position
0
0
4
0
0
2
0
3
2
0
0
3
0
0
0
0
0
2
0
0

CASE

Conditional logic in `SQL.

-- Using CASE statements
p = """
SELECT 
  first_name, 
  last_name,
       CASE 
         WHEN salary > 60000 THEN 'High'
         WHEN salary BETWEEN 40000 AND 60000 THEN 'Medium'
         ELSE 'Low'
       END AS salary_category
FROM 
  employees;
"""

pd.read_sql_query(p, con)
first_name last_name salary_category
Alice Smith Medium
Bob Johnson High
Charlie Lee Medium
David Kim High
Eva Brown Medium
Frank Wilson High
Grace Taylor Medium
Henry Anderson High
Irene Thomas Medium
Jack White Medium
Karen Harris High
Leo Martin Medium
Mia Jackson Medium
Noah Lee High
Olivia Perez High
Paul Young Medium
Quinn King High
Rachel Scott Medium
Sam Green Medium
Tina Adams High

Joins

Introduction

It’s rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must join them together to answer the questions that you’re interested in.

pandas has a really rich set of options for combining one or more data frames, with the two most important being concatenate and merge. Some of the examples in this chapter show you how to join a pair of data frames. Fortunately this is enough, since you can combine three data frames by combining two pairs.

# remove cell
import matplotlib_inline.backend_inline
import matplotlib.pyplot as plt

# Plot settings
plt.style.use("https://github.com/aeturrell/python4DS/raw/main/plot_style.txt")
matplotlib_inline.backend_inline.set_matplotlib_formats("svg")

Prerequisites

This chapter will use the pandas data analysis package.

Concatenate

If you have two or more data frames with the same index or the same columns, you can glue them together into a single data frame using pd.concat().

For the same columns, pass axis=0 to glue the index together; for the same index, pass axis=1 to glue the columns together. The concatenate function will typically be used on a list of data frames.

If you want to track where the original data came from in the final data frame, use the keys keyword.

Here’s an example using data on two different states’ populations that also makes uses of the keys option:

import pandas as pd
import urllib.request

base_url = "http://www.stata-press.com/data/r14/"
state_codes = ["ca", "il"]
end_url = "pop.dta"
headers = {'User-Agent': 'Mozilla/5.0'}

def fetch_data(url):
    req = urllib.request.Request(url, headers=headers)
    with urllib.request.urlopen(req) as response:
        return pd.read_stata(response)

# This grabs the two data frames, one for each state
list_of_state_dfs = [fetch_data(base_url + state + end_url) for state in state_codes]

# Show example of first entry in list of data frames
print(list_of_state_dfs[0])
        county      pop
0  Los Angeles  9878554
1       Orange  2997033
2      Ventura   798364
# Concatenate the list of data frames
df = pd.concat(list_of_state_dfs, keys=state_codes, axis=0)

Note that the keys argument is optional, but is useful for keeping track of origin data frames within the merged data frame.

Exercise

Concatenate the follow two data frames:

df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])

df2 = pd.DataFrame([['c', 3], ['d', 4]],
                   columns=['letter', 'number'])

Merge

There are so many options for merging data frames using pd.merge(left, right, on=..., how=... that we won’t be able to cover them all here. The most important features are: the two data frames to be merged, what variables (aka keys) to merge on (and these can be indexes) via on=, and how to do the merge (eg left, right, outer, inner) via how=. This diagram shows an example of a merge using keys from the left-hand data frame:

The how= keyword works in the following ways: - how='left' uses keys from the left data frame only to merge. - how='right' uses keys from the right data frame only to merge. - how='inner' uses keys that appear in both data frames to merge. - how='outer' uses the cartesian product of keys in both data frames to merge on.

Let’s see examples of some of these:

left = pd.DataFrame(
    {
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)
right = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["K0", "K0", "K0", "K0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)
# Right merge
pd.merge(left, right, on=["key1", "key2"], how="right")
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3

Note that the key combination of K2 and K0 did not exist in the left-hand data frame, and so its entries in the final data frame are NaNs. But it does have entries because we chose the keys from the right-hand data frame.

What about an inner merge?

pd.merge(left, right, on=["key1", "key2"], how="inner")
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2

Now we see that the combination K2 and K0 are excluded because they didn’t exist in the overlap of keys in both data frames.

Finally, let’s take a look at an outer merge that comes with some extra info via the indicator keyword:

pd.merge(left, right, on=["key1", "key2"], how="outer", indicator=True)
key1 key2 A B C D _merge
0 K0 K0 A0 B0 C0 D0 both
1 K0 K1 A1 B1 NaN NaN left_only
2 K1 K0 A2 B2 C1 D1 both
3 K1 K0 A2 B2 C2 D2 both
4 K2 K0 NaN NaN C3 D3 right_only
5 K2 K1 A3 B3 NaN NaN left_only

Now we can see that the products of all key combinations are here. The indicator=True option has caused an extra column to be added, called ’_merge’, that tells us which data frame the keys on that row came from.

Exercise

Merge the following two data frames using the left_on and right_on keyword arguments to specify a join on lkey and rkey respectively:

df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})

Exercise

Merge the following two data frames on "a" using how="left" as a keyword argument:

df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})

What do you notice about the position .loc[1, "c"] in the merged data frame?

For more on the options for merging, see pandas’ comprehensive merging documentation.

Introduction to SQL Joins

SQL joins are used to combine rows from two or more tables based on a related column between them. Understanding joins is crucial for effective data retrieval and manipulation. This section will cover the four main types of joins: INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN, using visual aids and code snippets with a fake data frame.

Types of Joins

INNER JOIN

An INNER JOIN returns only the rows that have matching values in both tables.

INNER JOIN
-- Inner join example
p = """
SELECT 
  A.id, 
  A.name, 
  B.order_id
FROM 
  Customers A
INNER JOIN 
  Orders B ON A.id = B.customer_id;
"""
pd.read_sql_query(p, con)

LEFT JOIN

A LEFT JOIN returns all the rows from the left table and the matched rows from the right table. If no match is found, NULL values are returned for columns from the right table.

LEFT JOIN
-- Left join example
p = """
SELECT 
  A.id, 
  A.name, 
  B.order_id
FROM 
  Customers A
LEFT JOIN 
  Orders B ON A.id = B.customer_id;
"""
pd.read_sql_query(p, con)

RIGHT JOIN

A RIGHT JOIN returns all the rows from the right table and the matched rows from the left table. If no match is found, NULL values are returned for columns from the left table.

RIGHT JOIN
p = """
SELECT 
  A.id, 
  A.name, 
  B.order_id
FROM 
  Customers A
RIGHT JOIN 
  Orders B ON A.id = B.customer_id;
"""
pd.read_sql_query(p, con)

FULL JOIN

A FULL JOIN returns all the rows when there is a match in either left or right table. Rows without a match in one of the tables will contain NULL values for columns from that table.

FULL JOIN
-- Full join example
p = """
SELECT 
  A.id, 
  A.name, 
  B.order_id
FROM 
  Customers A
FULL JOIN 
  Orders B ON A.id = B.customer_id;
"""
pd.read_sql_query(p, con)

Example Data Frames

To illustrate these joins, let’s consider two fake data frames: Customers and Orders.

import pandas as pd

# Creating a fake data frame for Customers
customers = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David']
})

# Creating a fake data frame for Orders
orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104],
    'customer_id': [1, 2, 2, 4]
})

# Display the data frames
print("Customers Data Frame")
print(customers)
Customers Data Frame
   id     name
0   1    Alice
1   2      Bob
2   3  Charlie
3   4    David
print("Orders Data Frame")
print(orders)
Orders Data Frame
   order_id  customer_id
0       101            1
1       102            2
2       103            2
3       104            4

Conclusion

SQL is an indispensable tool for anyone working with data. Its ability to manage, manipulate, and integrate data makes it essential for data analysis and decision-making. The examples provided in this chapter illustrate some of the key operations and techniques used in SQL, highlighting its importance in the field of data science.

Back to top